Decision effect of a deep-learning model to assist a head computed tomography order for pediatric traumatic brain injury

The study aims to measure the effectiveness of an AI-based traumatic intracranial hemorrhage prediction model in the decisions of emergency physicians regarding ordering head computed tomography (CT) scans. We developed a deep-learning model for predicting traumatic intracranial hemorrhages (DEEPTICH) using a national trauma registry with 1.8 million cases. For simulation, 24 cases were selected from previous emergency department cases. For each case, physicians made decisions on ordering a head CT twice: initially without the DEEPTICH assistance, and subsequently with the DEEPTICH assistance. Of the 528 responses from 22 participants, 201 initial decisions were different from the DEEPTICH recommendations. Of these 201 initial decisions, 94 were changed after DEEPTICH assistance (46.8%). For the cases in which CT was initially not ordered, 71.4% of the decisions were changed (p < 0.001), and for the cases in which CT was initially ordered, 37.2% (p < 0.001) of the decisions were changed after DEEPTICH assistance. When using DEEPTICH, 46 (11.6%) unnecessary CTs were avoided (p < 0.001) and 10 (11.4%) traumatic intracranial hemorrhages (ICHs) that would have been otherwise missed were found (p = 0.039). We found that emergency physicians were likely to accept AI based on how they perceived its safety.

The influence of recommendation directions. Figure 2 shows the flows of the responses by participants regarding the head CT binary decision before and after the DEEPTICH recommendation. The responses in which the initial decision was the same as the DEEPTICH recommendations (n = 327, 61.9%) were excluded. We analyzed the responses that were different from the initial decision and the DEEPTICH recommendations (n = 201, 38.1%).
Of the 201 responses, 56 decisions were not to order head CTs as the initial decision; however, when DEEP-TICH recommended the head CT, 40 of the 56 (71.4%) decisions were changed, i.e., respondents decided to www.nature.com/scientificreports/ order head CTs (p < 0.001). When DEEPTICH recommended not to order head CTs, only 54 (36.6%) of the 145 initial decisions to order CTs were changed (p < 0.001).
We analyzed the responses of all the five-scale head CT ordering willingness scores based on DEEPTICH recommendation (n = 528). We found considerable score changes and decision augmentations according to DEEP-TICH recommendation. In cases where DEEPTICH advised head CTs, the mean of the willingness score changed from 3.46 to 3.97 (Δwillingness, 0.51). In cases where DEEPTICH advised not to order head CTs, the mean score changed from 2.69 to 2.27 (Δwillingness, − 0.42). The detailed results are presented in Supplementary Table S2.
The physician's factor of influence. It was observed that when DEEPTICH recommended not to order a head CT, the decision effect differed based on the age and experience of the physician. Relatively inexperienced physicians were more likely to accept the recommendation than experienced physicians (− 43 (29%) vs. − 11 www.nature.com/scientificreports/ (7.3%), p < 0.001). Physicians older than 40 years did not change their decision, even though most of the physicians in the age group of 30-40 years did so: 0 (0.0%) vs. − 36 (20.0%), p = 0.021 (Table 2).
Factors associated with AI acceptance. We conducted univariate and multivariate logistic analyses to identify the factors associated with the effectiveness of DEEPTICH. The participants were more likely to accept AI recommendations when the PECARN risk was high (Odds ratio (OR), 15.02; 95% CI 1.60-473.38) and when the initial head CT decision was no (OR, 2.68; CI 1.08-6.88) ( Table 3). We also conducted a logistic regression on the survey outcomes; no item was associated with the effectiveness of DEEPTICH.
DEEPTICH effectiveness by PECARN risk. We analyzed head CT decisions prior to and after DEEP-TICH recommendation using PECARN risk rules (  Table 5 presents the overall clinical outcomes using DEEPTICH. When using DEEPTICH, 46 (11.6%) unnecessary CTs were avoided (p < 0.001) and 10 (11.4%) all traumatic ICHs that would have been otherwise missed were found (p = 0.039). Survey outcome. The survey outcomes are presented in Table 6

Discussion
To the best of our knowledge, this is the first study to develop a decision simulation study design and to investigate the acceptance of AI in clinical decisions by physicians. Most AI-based clinical decision support system (AI-CDSS) studies have reported improved diagnostic accuracy or efficiency based on the agreements with AI  34,35 . However, before considering the effectiveness of an AI approach on accuracy, it is necessary to know in detail its function in the clinical decision-making process.
In this study, we developed DEEPTICH, a deep-learning model for predicting traumatic ICHs. DEEPTICH had higher AUROC than previously known pediatric head CT rules 36 . Because the rate of traumatic brain injury (TBI) is higher in younger age groups than in older age groups, this difference in data was considered when setting the model threshold, which made DEEPTICH have less specificity in younger age group 37,38 .   www.nature.com/scientificreports/ Subsequently, we identified that the effect of AI on decision making of the physician is influenced by various factors; one of those factors is the recommendation direction (positive vs. negative). We found that when the suggestion direction of the model is positive, emergency physicians are more likely to accept the recommendations of the model; whereas, when the suggestion direction of the model is negative, the decision change differs based on work years and age of the physician, suggesting that inexperienced clinicians are significantly more likely to be influenced by AI tools than experienced clinicians.
We demonstrated that DEEPTICH is effective, even when the AI-CDSS and the initial decisions of physicians are the same. After realizing that AI-CDSS concur with their initial decision, the level of confidence increased significantly, which is important, because clinical decisions are often challenged by non-clinical factors, both socially and psychologically.
As DEEPTICH only predicts ICHs, excluding microhemorrhages, there may be some reluctance in adopting its recommendations by clinicians because microhemorrhages are a clinically important sign of significant diffuse TBI. Despite ICH being the most common pediatric TBI for neurosurgical intervention 39,40 , for a more effective and reliable model, the prediction of other abnormalities must be considered.
We believe that DEEPTICH can make an impact in improving clinical outcomes. Overall, DEEPTICH is helpful in reducing unnecessary head CTs and missed ICH cases. Although, the model decision effect is not significant in the intermediate group, approximately 70% of children in the low-or high-risk groups of head trauma can benefit from using DEEPTICH through enhanced ordering head CT in high risk groups and decreasing ordering head CT in low risk groups.
The survey outcome indicated that physicians were concerned about the clinical safety and information quality of DEEPTICH. Therefore, we propose that for the clinical use of medical AI, development information, such as data processing and modeling should be described in a greater detail to physicians to alleviate their concerns.
Consequently, considering the results and sensitivity of DEEPTICH, we suggest using DEEPTICH with conventional head CT rules in optimizing the prevention of adverse outcomes and unnecessary head CTs. This model can be used to supplement standard head CT rules even if the case history is not filled out or it has been 24 h since the last visit, especially for doctors with less experience.
This study has two limitations. First, the simulation cases are not representative of real-world pediatric TBI populations; there is a greater proportion of low-risk TBI patients in the real world, therefore the decision effect and clinical performance of the model may be different. Second, the simulation cases were non-randomly selected; in the selected cases, DEEPTICH results were correlated to real cases. Therefore, we did not evaluate the accuracy and superiority of DEEPTICH in this well-designed decision simulation study. When implementing DEEPTICH in a real-world clinical setting, the rate of AI acceptance by the physician might be different.
We found that AI acceptance was affected by multiple factors, such as the characteristics of the physician, risk of cases, and the recommendation of DEEPTICH, making it difficult to predict the effect of the model in the real world. Therefore, when implementing AI CDSS in clinical scenarios, we suggest considering the model www.nature.com/scientificreports/ performance along with its acceptance by physicians. To assess improvements in the clinical outcomes, randomized clinical trials in real-world setting are required.

Conclusions
DEEPTICH affects decisions of emergency physicians to order head CTs, as demonstrated by the decision simulation study. The effectiveness of the model is more significant when the model recommends ordering of head CTs.

Methods
This study was approved by the Institutional Review Board (IRB) of Samsung Medical Center IRB Nos. 2020-07-072 and 2020-09-218. We conducted the decision simulation study from April 26, 2021 to June 5, 2021. Informed consent was obtained from all participants. We confirm that all the experiments were performed in accordance with the relevant guidelines and regulations.

Deep-learning model for predicting traumatic intracranial hemorrhages (DEEPTICH) development in clinical decision support system (CDSS). Dataset for deep learning.
Two data sources were used in this study: the ED-based injury in-depth surveillance (EDIIS) database, and the trauma registry database of the Samsung Medical Center (SMC). The EDIIS dataset was used for model training and internal validation; the SMC dataset was used for external validation and to investigate the effectiveness of DEEPTICH. Detailed data selection criteria are provided in Supplementary Fig. S1. The EDIIS database was established based on the International Classification of External Causes of Injuries by the World Health Organization. The database includes prehospital records, clinical findings, diagnosis, treatment, dispositions in the ED, inpatient information, demographics, and injury-related factors of the patients. Information of 1.8 million patients from 25 EDs were included in this surveillance database. Each participating hospital assigned coordinators for data collection and management, and the Korea Centers for Disease Control regularly checked the quality of the entire data from the 25 EDs. In this study, the records of 750,000 patients with head injuries from January 1, 2011 to December 31, 2017 were used for derivation and time-split validation.
The SMC database contains medical records from a tertiary academic hospital in South Korea with approximately 2000 beds and an average of 200 ED patient visits per day. This database includes the records of procedures and clinical notes, as well as the same information collected in EDIIS. The SMC dataset was collected from January 1, 2012 to December 31, 2019, with 67,578 patient records. These data were used for multi-center validation in this study.
Model training for deep learning. We used the patient demographics, vital signs, mental status, injury-related factors, date and time-related information regarding the injury onset and visit, and symptoms for predictors. Demographics included the age and sex. Vital signs included the respiratory rate, body temperature, systolic and diastolic blood pressure, and pulse rate. Mental status data included the "alert, verbal, pain, unresponsive" scale and Glasgow coma scale (GCS) scores. Injury-related factors included the injury mechanism, activity during the injury, alcohol-related factors, intentions, place of the injury, material causing the injury, and the time taken from injury onset to the visit. Time-related predictors included the injury onset and visit date and time information, such as the day of the week and hour of the injury onset.
Multiple outcomes were used for training the model. The primary outcome was ICH, such as cerebral contusion, subdural hemorrhage, epidural hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, intracerebral hemorrhage, and cerebellar hemorrhage. Other outcomes, such as TBIs other than ICH, visit dispositions, and operations related to head injuries were considered as secondary outcomes. The purpose of secondary outcomes was to improve the prediction performance of the model in multi-task learning.
Machine learning algorithm. Multi-task learning was used for the ML algorithm to classify the ICHs and secondary outcomes. Multi-task deep learning is a method of training multiple learning tasks simultaneously during the training phase. The advantage of multi-task learning is that it can exploit useful information based on the commonalities and differences in the different tasks during training. In our case, there were commonalities and differences in hemorrhages and other TBIs, visit dispositions (patients with ICHs and serious TBIs are more likely to be admitted to the hospital), and head-injury related operations (some ICHs require acute interventions).
Algorithm threshold selection. There were numerous options to select an appropriate threshold for binary prediction for the DEEPTICH, including the Youden index, thresholds for generating the best F-1 score, 0.97/0.95/0.9 sensitivity, and mean threshold among groups of populations whose outcome was 1. We generated case examples for each option, and each option was reviewed by a clinician. Finally, the best threshold selected as a negative predicted value in each age group was 0.99 because it best reflected the clinical decision-making in real clinical circumstances.
Participants for decision simulation study. The participants were residents and specialists in ED of the SMC, a single tertiary academic hospital in South Korea. We defined DEEPTICH effectiveness as a change in a head CT decision based on the DEEPTICH recommendation when the initial decision of the participant differed from the DEEPTICH recommendation. We calculated the DEEPTICH effectiveness mean and SD for five emergency physicians, which were respectively 51.0% and 10.6%. We derived the appropriate number of Simulation cases selection. We conducted a simulation study using pediatric cases because the decision of ordering a head CT in the pediatric population is more challenging than that in adult patients. We selected 24 simulated pediatric cases who visited the ED with a fall down mechanism from the SMC validation dataset. We stratified cases based on the PECARN rule and patient age (Supplementary Table S3). This study focused on the effectiveness of AI regarding the decision of the physician, not its accuracy, and therefore selected only cases in which DEEPTICH had the same results as the patient outcome. We performed a trend test (Cochran Armitage test) and Spearman correlation analysis to verify the validity of the simulated cases. For high-risk cases, we confirmed that the extent of the decision on the head CT ordering was significant, and that the head CT ordering willingness on a five-point scale was high.
Four sub-questions were asked for each case. Sub-questions 1 and 2 were asked before the DEEPTICH recommendation, and sub-questions 3 and 4 were asked after the DEEPTICH recommendation. Sub-questions 1 and 3 were identical to the head CT order binary decision. Sub-questions 2 and 4 were also identical to those concerning the willingness of head CT ordering, i.e., five-point scale score. The participants answered all four sub-questions. DEEPTICH presented three pieces of information: (1) the ICH probability of the case; (2) a top percentage of probability of an ICH in the same age group; and (3) a head CT order decision from the DEEP-TICH. The process of the simulation scenario is shown in Textbox 1.

Survey development.
We performed a survey to investigate the factors affecting the DEEPTICH effectiveness. The survey consisted of five questions regarding general medical AI and seven questions regarding the AI used in this study (i.e., the DEEPTICH). Study process. Consent was obtained from all participants. The participants were provided with the model information and the PECARN rules. Using the PECARN rules was left to the discretion of the physician, and its frequency was not measured. We explained the characteristics and development of DEEPTICH and its clinical performance, i.e., the sensitivity, specificity, negative predictive value, and positive predictive value. Details of model development are in the Supplementary Method. The participants were asked to answer four sub-questions for each simulation case: they were required to answer two questions before the DEEPTICH recommendation and the same two questions after DEEPTICH recommendation (Textbox 1). The simulation case was viewed in a Q-card format. After the completing the 24 simulation cases, the participants responded to the survey.
Outcomes. The primary outcome was the change in head CT order binary decision when the initial binary decision was different from the DEEPTICH recommendation. The secondary outcome was the change in the Case presentation (example) A four-year-old boy arrived at the emergency room presenting with persistent headache after falling from a height of 1 m. The initial GCS score at ED was 15; however, he had a loss of consciousness for 1-2 minutes. A scalp hematoma was palpated on the Rt. temporal head.
At the end of 24 simulation cases Respond to survey questionnaire 1) Do you order a head CT? Yes/No 2) How much do you think a head CT is necessary? (5-